“Everything is related to everything else, but near things are more related than distant things.”
# Libraries
library(tidyverse)
library(spdep)
library(rgdal)
library(sf)
library(sp)
library(rgeos)
library(terra)
library(DT)
colours = colorRampPalette(c("#fedb71","#FCD471","#facd71","#f6bf70","#eea26f","#EA946F","#e6866e","#e2786e","#e0716e","#dd696d"))
The necessary data for this type of analysis are:
# Read municipality data
tn <- readOGR("../data/aggregated_data_per_municipality", encoding="UTF-8")
## OGR data source with driver: ESRI Shapefile
## Source: "G:\Il mio Drive\2nd Year\Geospatial\TrentinoSchools\data\aggregated_data_per_municipality", layer: "aggregated_data_per_municipality"
## with 166 features
## It has 16 fields
## Integer64 fields read as strings: Pop under Pop_mat Pop_ele Pop_med Pop_sup Popolazion
# Setting the CRS
tn <- spTransform(tn, CRS("+init=epsg:4326"))
# Function to replace NAs with 0
na.zero <- function (x) {
x[is.na(x)] <- 0
return(x)
}
names(tn@data)[3] = "Scuole"
names(tn@data)[7] = "Media_stud_classe"
names(tn@data)[8] = "Media_stud_scuola"
names(tn@data)[9] = "Pop_under_20"
names(tn@data)[15] = "Pop_under_20_Pop_tot"
names(tn@data)[16] = "Stud_Pop_under_20"
# Replace NAs about municipalities' number of schools with 0s
tn@data$Scuole = na.zero(tn@data$Scuole)
# Plot the municipalities in Trentino
par(mar=c(0,0,0,0))
plot(tn, axes = F, border="grey")
It is worth noticing from the map of school points that many of them overlap, especially in the Adige’s Valley and most populated areas, such as Trento, Rovereto and Riva del Garda.
# Import shapefile as SpatialPointsDataFrame
schools <- readOGR("../data/Trentino/schools/schools.shp",verbose = FALSE)
# Setting the CRS
schools <- spTransform(schools, CRS("+init=epsg:4326"))
# Plot schools over Trentino's map
par(mar=c(0,0,0,0))
plot(tn, border = "grey", axes = F)
points(schools@coords, col = "#f27059", cex = 1, pch = 1)
Before starting with the global analysis of Trentino municipalities, it is necessary to select some representative points for each municipality as unique reference to the spatial coordinate.
par(mar=c(0,0,0,0))
plot(tn, border="grey")
points(coordinates(tn),
col="red",
bg = "#EF798A",
pch = 21,
lwd = 1.5)
Commonly, the centroid is a good choice, but still some problems can emerge and the centroid may not fall inside the boundaries of the territory. In this particular case, this may happen with multipolygons shape, i.e. those municipalities represented by a multiple number of polygons that do not share a boundary. Some examples are the municipalities of Tione di Trento, Ronzone, Stenico, Calliano, Pellizzano and Riva del Garda. Some municipalities, as Luserna, find their centroid in another territory because of their shape.
The plot depicts all those municipalities whose centroid is outside their actual territory. As can be seen, most of them have little zones outside the main one, while others (e.g. Predaia and Luserna) just have awkward shapes.
tn_coords = coordinates(tn)
tn@data$centroid = tn_coords
colorize = c()
for(i in 1:166){
colorize = append(colorize, point.in.polygon(tn$centroid[i,1],
tn$centroid[i,2],
tn@polygons[i][[1]]@Polygons[[1]]@coords[,1],
tn@polygons[i][[1]]@Polygons[[1]]@coords[,2]))
}
par(mar = c(0,0,0,0))
plot(tn, col = ifelse(colorize, "white","lightgrey"), border="grey")
text(tn_coords, labels = ifelse(colorize, "" ,tn@data$Comune), cex=0.6)
Instead of recurring to coordinates(), we could use
gCentroid to obtain an alternative version of centroids.
For most of the cases, these two versions coincide, but for those
municipalities with multiple polygons points may differ (e.g. Soraga di
Fassa in the upper right part). Since it computes a sort of mean point,
making the centroid being part of other territories, the rest of the
notebook will relate to the previous version
(i.e. coordinates), despite inaccurate in some cases.
par(mar=c(0,0,0,0))
trueCentroids = gCentroid(tn,byid=TRUE)
plot(tn, border="grey")
points(coordinates(tn),pch=1, col="blue")
points(trueCentroids,pch=2, col="red")
⚠️ Note that, instead, school dataset contains points and not polygons, therefore no problem occurs with the centroid computation. Also, since many schools have same coordinates because in the same building, the centroids will only consider unique points, reducing the dataset from 724 to 599 unique points.
Since there are various definitions of neighbourhood, we will try to explore three of them in the following sections, starting from K-nearest neighbour.
Since there is no way to choose a specific value for \(k\), we can iterate over a customized range, let’s say from 1 to 20 neighbours (i.e. schools). The following code generates an image for each value of \(k\).
# Saving schools coordinates
school_coords = coordinates(schools)
# Save frame per frame
for(i in 1:20){
png(paste0("../viz/knn/",i,".png"),res=300, width=1000, height=1000)
k <- knn2nb(knearneigh(school_coords, k = i, longlat=T))
par(mar = c(0,0,0,0))
plot(tn, border = "grey80", axis = tn, lwd=0.5)
plot(k, school_coords, lwd=.6, col=alpha("#F27059",alpha=0.5),
cex = .3, add=TRUE, points=FALSE)
dev.off()
}
⚠️ Note that through the function saveGIF()4 it would
be possible to save frames as animation, lowering the resolution of the
image obtained. Saving each frame will also allow users to select a
customized value for \(k\) and look at
how the map changes inside the website.
KNN on Trentino Schools
As \(k\) is increased, each school finds more and more neighbours, approaching also further points. If we focus on the first frame with \(k=1\), we may notice that some schools are isolated, since their closest neighbour requires long lines to be reached. An example is the municipality of Vermiglio, in the upper left part of Trentino, in the western boundary. There are in fact \(3\) schools in Vermiglio and one of them is near the boundary, far from other schools. A similar situation happens to Rabbi and Malé, to Luserna and Lavarone, but still Vermiglio is the municipality with the longest distance between schools within the local territory.
k <- knn2nb(knearneigh(school_coords, k = 1, longlat=T))
par(mar = c(0,0,0,0))
plot(tn, border = "grey80", axis = tn, lwd=0.5)
plot(k, school_coords, lwd=.6,
col=alpha("#F27059",alpha=0.5),
cex = .3, add=TRUE, points=FALSE)
text(tn_coords, labels = ifelse(tn@data$Comune %in% c("Vermiglio","Rabbi","Malé","Luserna","Lavarone","Predaia"),tn@data$Comune, ""), cex=0.6)
On the other hand, at the extreme opposite, with \(k=20\), we obtain the network of schools in Trentino, looking at the \(20\) closest schools around each point. Trento and Rovereto are the most intertwined zones, while green areas as Adamello-Brenta Natural Park (west boundary) and Valsugana (empty zone in the right part of the map) lack in the number of schools. In fact, as can be seen by the length of lines, distances between schools tend to increase by moving from the Adige Valley to the east and west boundary.
KNN with k=20
If instead we focus on the municipalities, we can limit to a lower \(k\), since we talk about polygons with other territories at their boundary, while some schools points may result isolated with low values of \(k\). By looking at \(k=5\) plot, we may notice how all municipalities are linked to each other, despite it does not seem so in the territory above Borgo Valsugana, where Dolomites can be found.
knn1 = knn2nb(knearneigh(tn_coords, k = 1, longlat=T))
knn2 = knn2nb(knearneigh(tn_coords, k = 2, longlat=T))
knn3 = knn2nb(knearneigh(tn_coords, k = 3, longlat=T))
knn4 = knn2nb(knearneigh(tn_coords, k = 4, longlat=T))
knn5 = knn2nb(knearneigh(tn_coords, k = 5, longlat=T))
par(mar=c(0,0,0,0))
plot(tn, border = "grey80", axis = tn, lwd=0.5)
plot(knn5, tn_coords, lwd=.6, col=alpha("#F27059",alpha=0.5),
cex = .3, add=TRUE, points=FALSE)
Our aim in this section will be to find the minimum threshold distance which allows all regions/points to have at least one neighbour. By setting \(k=1\) in the k-nearest neighbour, we can first compute the nearest neighbour to each school and the relative distance and then get the maximum distance among them.
knn1<- knn2nb(knearneigh(school_coords,
k=1,
longlat=T))
all.linked <- max(unlist(nbdists(knn1,
school_coords,
longlat=T)))
all.linked
## [1] 9.182375
According to the results, all schools have a neighbour at at least 9.1823746 km. This implies that the cut-off distance has to be greater than it. However, notice from the following plot the distribution of school distances: the majority of them is near \(0\)km, following a long tail distribution. This may happen in big cities, as Trento and Rovereto, where there are a lot of schools and the minimum distance between them lowers.
distances = unlist(nbdists(knn1,school_coords,longlat=T))
ggplot()+
geom_histogram(aes(x=distances), fill='#F27059', bins=50)+
labs(title = "Distribution of distances between schools")+
theme_minimal()
knn1<- knn2nb(knearneigh(tn_coords,
k=1,
longlat=T))
all.linked <- max(unlist(nbdists(knn1,
tn_coords,
longlat=T)))
all.linked
## [1] 11.16466
We can repeat the same computation on municipalities centroids, discovering that every municipality has a neighbour at at least 11.1646609 km, slightly greater than the previous cut-off distance, which could mean that there are multiple schools in every municipality or that they are close enough to the boundary to be neighbour of other municipalities’ schools. Analyzing the distribution of distances between municipalities, we may notice that the majority of them distances from others from \(2\) to \(4\)km.
distances = unlist(nbdists(knn1,tn_coords,longlat=T))
ggplot()+
geom_histogram(aes(x=distances), fill='#F27059', bins=50)+
labs(title="Distribution of distances between schools")+
theme_minimal()
We can try different neighbourhood definitions for different values of the cut-off distance, starting from the minimum threshold found before (i.e. \(9.18\)km).
dnb10 <- dnearneigh(school_coords, 0, 10, longlat=TRUE); dnb10
## Neighbour list object:
## Number of regions: 724
## Number of nonzero links: 54326
## Percentage nonzero weights: 10.36408
## Average number of links: 75.03591
dnb15 <- dnearneigh(school_coords, 0, 15, longlat=TRUE); dnb15
## Neighbour list object:
## Number of regions: 724
## Number of nonzero links: 93048
## Percentage nonzero weights: 17.75129
## Average number of links: 128.5193
dnb20 <- dnearneigh(school_coords, 0, 20, longlat=TRUE); dnb20
## Neighbour list object:
## Number of regions: 724
## Number of nonzero links: 140056
## Percentage nonzero weights: 26.71927
## Average number of links: 193.4475
dnb25 <- dnearneigh(school_coords, 0, 25, longlat=TRUE); dnb25
## Neighbour list object:
## Number of regions: 724
## Number of nonzero links: 192010
## Percentage nonzero weights: 36.63083
## Average number of links: 265.2072
dnb30 <- dnearneigh(school_coords, 0, 30, longlat=TRUE); dnb30
## Neighbour list object:
## Number of regions: 724
## Number of nonzero links: 247346
## Percentage nonzero weights: 47.18759
## Average number of links: 341.6381
As the cut-off distance increases, the number of links grows rapidly. Based on the visualization, we could have stopped at 20, where nearly every school is connected to others.
plot_neighbour = function(model, coords, title){
par(mar=c(0,0,1,0))
plot(tn, border="grey",xlab="",ylab="",xlim=NULL)
title(main=title, cex.main=0.8)
plot(model, coords, add=TRUE, col="#F27059", pch=16, lwd = 1, points=FALSE)
}
par(mfrow = c(3,2))
plot_neighbour(dnb10, school_coords, "d nearest neighbours, d = 10")
plot_neighbour(dnb15, school_coords, "d nearest neighbours, d = 15")
plot_neighbour(dnb20, school_coords, "d nearest neighbours, d = 20")
plot_neighbour(dnb25, school_coords, "d nearest neighbours, d = 25")
plot_neighbour(dnb30, school_coords, "d nearest neighbours, d = 30")
The same approach could be applied to municipalities data, obviously creating a network based on the territories around a certain area. Remembering that the cut-off threshold in this case is above \(11\), we can start with \(12\).
dnb12 <- dnearneigh(tn_coords, 0, 12, longlat=TRUE); dnb12
## Neighbour list object:
## Number of regions: 166
## Number of nonzero links: 2000
## Percentage nonzero weights: 7.257947
## Average number of links: 12.04819
dnb16 <- dnearneigh(tn_coords, 0, 16, longlat=TRUE); dnb16
## Neighbour list object:
## Number of regions: 166
## Number of nonzero links: 3314
## Percentage nonzero weights: 12.02642
## Average number of links: 19.96386
dnb20 <- dnearneigh(tn_coords, 0, 20, longlat=TRUE); dnb20
## Neighbour list object:
## Number of regions: 166
## Number of nonzero links: 4866
## Percentage nonzero weights: 17.65859
## Average number of links: 29.31325
dnb24 <- dnearneigh(tn_coords, 0, 24, longlat=TRUE); dnb24
## Neighbour list object:
## Number of regions: 166
## Number of nonzero links: 6712
## Percentage nonzero weights: 24.35767
## Average number of links: 40.43373
dnb30 <- dnearneigh(tn_coords, 0, 30, longlat=TRUE); dnb30
## Neighbour list object:
## Number of regions: 166
## Number of nonzero links: 9652
## Percentage nonzero weights: 35.02685
## Average number of links: 58.14458
par(mfrow = c(3,2))
plot_neighbour(dnb12, tn_coords, "d nearest neighbours, d = 12")
plot_neighbour(dnb16, tn_coords, "d nearest neighbours, d = 16")
plot_neighbour(dnb20, tn_coords, "d nearest neighbours, d = 20")
plot_neighbour(dnb24, tn_coords, "d nearest neighbours, d = 24")
plot_neighbour(dnb30, tn_coords, "d nearest neighbours, d = 30")
Also in this case the number of connections grows rapidly, indicating how close the municipalities are between each other. Consider that the maximum distance within province of Trento between municipalities is around 120km (considering the centroids, therefore the actual distance is greater), but over the \(75\%\) of municipalities has an area below \(50km^2\), which allows them to be connected with brief distances.
# Quantiles of areas of municipalities within the Province of Trento
tn@data$area = round(area(tn)/ 1000000,3)
area = tn@data %>%
arrange(desc(area)) %>%
select(Comune, area)
quantile(area$area)
## 0% 25% 50% 75% 100%
## 1.65200 12.95100 25.57700 48.97075 199.55200
# Find max distance within the province centroids
library(geosphere)
## Warning: il pacchetto 'geosphere' è stato creato con R versione 4.1.2
diff = c()
for(i in 1:166 ){
for (j in 1:166){
diff = append(diff, distm(tn_coords[i,],tn_coords[j,], fun = distHaversine))
}
}
# Maximum distance between centroids within the province of Trento
max(diff)/1000
## [1] 120.7196
Since schools are represented as points, we will use municipalities data to connect territories with common boundary, i.e. multiple municipalities around. From the visualization it is worth noticing the spiderweb created around the municipalities on the inside of Trentino, while those more disconnected are placed on the border of the Province, especially the territories on the upper right part of the map (e.g. Canazei, San Giovanni di Fassa, Mazzin, Campitello di Fassa, Soraga di Fassa, Moena).
par(mar=c(0,0,0,0))
contnb_q <- poly2nb(tn, queen=T)
plot(tn, border="grey")
plot(contnb_q, tn_coords, add=TRUE, col="#EF798A")
points(coordinates(tn),
col="red",
bg = "#EF798A",
pch = 21,
lwd = 1.5)
⚠️ Note that there are 166 municipalities in the Province of Trento. By sorting them according to the shape area, we get that the biggest areas do not share at least one boundary, since they take place on the border or in mountainous zones; while Trento occupies a central position.
area%>%
DT::datatable()
After the definition of neighbourhoods, we can create spatial weights matrices, one for each neighbour list previously created.
# K-nearest neighbour
knn1.list = nb2listw(knn1)
knn2.list = nb2listw(knn2)
knn3.list = nb2listw(knn3)
knn4.list = nb2listw(knn4)
knn5.list = nb2listw(knn5)
# Critical cut-off
dnb12.list = nb2listw(dnb12,style="W")
dnb16.list = nb2listw(dnb16,style="W")
dnb20.list = nb2listw(dnb30,style="W")
dnb24.list = nb2listw(dnb24,style="W")
dnb30.list = nb2listw(dnb30,style="W")
# Contiguity based approach
contnb_q.list = nb2listw(contnb_q)
# List with weights lists and their name
weights = list(
list(knn1.list, "K-nearest neighbour (k=1)"),
list(knn2.list, "K-nearest neighbour (k=2)"),
list(knn3.list, "K-nearest neighbour (k=3)"),
list(knn4.list, "K-nearest neighbour (k=4)"),
list(knn5.list, "K-nearest neighbour (k=5)"),
list(dnb12.list, "Critical cut-off neighbourhood (d=12)"),
list(dnb16.list, "Critical cut-off neighbourhood (d=16)"),
list(dnb20.list, "Critical cut-off neighbourhood (d=20)"),
list(dnb24.list, "Critical cut-off neighbourhood (d=24)"),
list(dnb30.list, "Critical cut-off neighbourhood (d=30)"),
list(contnb_q.list, "Contiguity-based neighbourhoord")
)
In the section, we will focus on Moran’s I test of spatial autocorrelation of Trentino Schools, in particular on the number of schools and students that populate every municipality. Let’s start plotting the distribution of some features over the territory.
⚠️ Note that we do not hold data about students of all schools in Trentino, therefore some data might be missing. In these cases, the plots below will show the respective area in white.
cols = list(tn$Scuole,tn$Studenti, tn$Classi, tn$Media_stud_scuola, tn$Pop_under_20_Pop_tot, tn$Stud_Pop_under_20, tn$Media_stud_classe)
titles = c("Schools","Students","Classes","Mean of students per school","Population under 20 over Total Population", "Students over Population under 20", "Mean students per class")
colours <- c("#fedb71","#FCD471","#facd71","#f6bf70","#eea26f","#EA946F","#e6866e","#e2786e","#e0716e","#dd696d")
na.ignore = function(x){
x[is.na(x)] <- -1
return(x)
}
par(mfrow=c(4,2),mar = c(0,0,1.7,0))
for(i in 1:7){
c = na.ignore(unlist(cols[i]))
brks <- round(quantile(c, seq(0,1,0.1)), digits=3)
plot(tn, col=ifelse(c==-1,
"#ffffff",
colours[findInterval(c, brks, all.inside=TRUE)]),
main = titles[i], cex.main=2.5)
}
Now we can try to compute the Moran’s test based on all the previous definitions of neighborhood and with the previous features exposed, trying to find some spatial autocorrelation.
Neighbourhood = c()
Column = c()
Sd = c()
p_value = c()
Moran_I_statistic = c()
Mean = c()
Var = c()
Assumption = c()
# Iterate over columns
for(i in 1:length(cols)){
# Iterate over neighbourhood
for (w in weights) {
c = na.zero(unlist(cols[i]))
# Iterate over assumptions
for (rand in c(T,F)) {
Neighbourhood = append(Neighbourhood, w[[2]])
res = moran.test(c, w[[1]], randomisation = rand)
Column = append(Column, titles[i])
Sd = append(Sd, round(as.numeric(res[[1]]),4))
p_value = append(p_value, round(as.numeric(res[[2]]), 4))
Moran_I_statistic = append(Moran_I_statistic, round(as.numeric(res[[3]][1]),4))
Mean = append(Mean, round(as.numeric(res[[3]][2]),4))
Var = append(Var, round(as.numeric(res[[3]][3]),4))
if(rand) {
Assumption = append(Assumption, "Randomization")
}else{
Assumption = append(Assumption, "Normality")
}
}
Neighbourhood = append(Neighbourhood, w[[2]])
res = moran.mc(c, w[[1]], nsim=999)
Column = append(Column, titles[i])
Sd = append(Sd,round(as.numeric(res[[1]]),4))
p_value = append(p_value, round(res$p.value),4)
Moran_I_statistic = append(Moran_I_statistic, round(res$statistic,4))
Mean = append(Mean, "")
Var = append(Var, "")
Assumption = append(Assumption, res$method)
}
}
# create df with results and show them
res_df = data.frame(Column, Neighbourhood, Moran_I_statistic, p_value,
Sd, Mean, Var, Assumption)
res_df %>%
arrange(desc(abs(Moran_I_statistic)), p_value) %>%
DT::datatable()
By ordering results according to the absolute value of the Moran’s I statistics, we can see that the highest statistics obtained are around the \([0.1789, 0.1952]\) interval, got in all three assumptions. Albeit mean students per school seem to be the column with highest Moran’s statistics, the p-value is not below the threshold of \(0.05\) for the majority of its observations. In fact, the mean of students per school results significative with neighbourhoods knn (k=5) and critical cut-off with d=16.
On the other hand, the proportion of Population under 20 over the total population seems significative with every configuration, except for some knns neighbourhoods. However, Moran’s statistics is lower than the ones gathered considering the mean of students per school.
Both these two columns show a positive spatial autocorrelation, while negative ones are associated to the number of Students over the Population under 20, with low p-value and minimum value \(-0.1296\) for Moran’s I statistics.
res_df%>%
group_by(Column) %>%
summarise("Median Moran" = median(Moran_I_statistic), "Median p-value" = median(p_value),
"Mean Moran" = round(mean(Moran_I_statistic),4), "Mean p-value" = round(mean(p_value),4)) %>%
DT::datatable()
Considering an aggregated table with median and mean statistics and p-value, we can confirm that the columns with highest statistics are:
The low mean and median p-values for Students confirms the absence of spatial autocorrelation based on the number of students. Nearly the same happens to the number of Classes and Schools.
Let’s start by trying to model the mean students per class with all the remaining features we have explored. The summary shows that all predictors are significant, except for the population under 20 over total population and the number of students.
LinearMean <- lm(Media_stud_classe ~ Stud_Pop_under_20+Scuole+Classi+Media_stud_scuola+Pop_under_20_Pop_tot+Studenti, tn, na.action = na.ignore)
summary(LinearMean)
##
## Call:
## lm(formula = Media_stud_classe ~ Stud_Pop_under_20 + Scuole +
## Classi + Media_stud_scuola + Pop_under_20_Pop_tot + Studenti,
## data = tn, na.action = na.ignore)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4646 -1.1711 -0.0065 1.2772 7.8307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.844835 1.541773 3.142 0.002 **
## Stud_Pop_under_20 6.955752 0.425146 16.361 < 2e-16 ***
## Scuole 0.839792 0.133982 6.268 3.30e-09 ***
## Classi -0.376734 0.050486 -7.462 5.24e-12 ***
## Media_stud_scuola 0.040702 0.003932 10.352 < 2e-16 ***
## Pop_under_20_Pop_tot 8.852831 8.258840 1.072 0.285
## Studenti 0.014127 0.002546 5.549 1.18e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.213 on 159 degrees of freedom
## Multiple R-squared: 0.906, Adjusted R-squared: 0.9024
## F-statistic: 255.3 on 6 and 159 DF, p-value: < 2.2e-16
With the step function it is possible to simplify this model considering only those features with high significance, excluding the features with no significance (i.e. Pop_under_20_Pop_tot).
# Searching for a simplified model where every feature has high significance
LinearMean = step(LinearMean)
## Start: AIC=270.54
## Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi + Media_stud_scuola +
## Pop_under_20_Pop_tot + Studenti
##
## Df Sum of Sq RSS AIC
## - Pop_under_20_Pop_tot 1 5.63 784.17 269.74
## <none> 778.54 270.54
## - Studenti 1 150.77 929.31 297.93
## - Scuole 1 192.37 970.91 305.20
## - Classi 1 272.65 1051.20 318.39
## - Media_stud_scuola 1 524.68 1303.22 354.06
## - Stud_Pop_under_20 1 1310.68 2089.22 432.41
##
## Step: AIC=269.74
## Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi + Media_stud_scuola +
## Studenti
##
## Df Sum of Sq RSS AIC
## <none> 784.17 269.74
## - Studenti 1 147.24 931.41 296.30
## - Scuole 1 200.34 984.51 305.51
## - Classi 1 269.92 1054.09 316.84
## - Media_stud_scuola 1 638.28 1422.44 366.59
## - Stud_Pop_under_20 1 1306.95 2091.12 430.56
summary(LinearMean)
##
## Call:
## lm(formula = Media_stud_classe ~ Stud_Pop_under_20 + Scuole +
## Classi + Media_stud_scuola + Studenti, data = tn, na.action = na.ignore)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.5592 -1.1207 0.0051 1.2213 7.7849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.442776 0.393632 16.368 < 2e-16 ***
## Stud_Pop_under_20 6.908668 0.423067 16.330 < 2e-16 ***
## Scuole 0.853246 0.133455 6.394 1.70e-09 ***
## Classi -0.374531 0.050468 -7.421 6.46e-12 ***
## Media_stud_scuola 0.042152 0.003694 11.412 < 2e-16 ***
## Studenti 0.013921 0.002540 5.481 1.61e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.214 on 160 degrees of freedom
## Multiple R-squared: 0.9053, Adjusted R-squared: 0.9023
## F-statistic: 305.9 on 5 and 160 DF, p-value: < 2.2e-16
The plot of the studentized residuals of the linear model can give us a hint about the presence of spatial dependence in the residuals. In fact, some similarities may be found in the east part, near the border and within the Adamello Brenta Park.
par(mar=c(0,0,1,0))
studres <- rstudent(LinearMean)
resdistr <- round(quantile(studres, seq(0,1,0.25)), digits=3)
colours_5 <- colours(5)
plot(tn, col=colours_5[findInterval(studres, resdistr, all.inside=TRUE)],
main = "Residuals quantiles in Trentino",
border="grey20")
The command that allows to perform the Moran’s I test in the OLS
residuals is the function lm.morantest(). In the following
chunk, the test to the studentized residuals of the linear Solow model,
for different specifications of the spatial weights matrix. This method
will be applied to all neighbourhoods definition, except KNN with \(k\) lower than 4 and the contiguity
neighbourhood, since they return unusual results.
ols_res = data.frame(Neighbourhood = c(""),
Moran = c(""),
p_value = c(""))
# Moran test on residuals
for(i in 4:11){
t = lm.morantest(LinearMean,weights[[i]][[1]],resfun=rstudent)
ols_res = rbind(ols_res, c(weights[[i]][[2]],
t$estimate['Observed Moran I'],
t$p.value))
}
ols_res$Moran = as.numeric(ols_res$Moran)
ols_res$p_value = as.numeric(ols_res$p_value)
ols_res %>%
arrange(desc(abs(Moran))) %>%
filter(!is.na(Moran)) %>%
DT::datatable()
The obtained results provide a negative spatial autocorrelation, but the p-value is far from being below to the threshold to confirm this hypothesis. Therefore, no spatial autocorrelation is found in the residuals of this model.
By focusing instead on the population under 20 over the total population for each municipality, we obtain from the summary that only the mean students per school is highly significative, while the students over population under 20’s p-value is slightly above \(0.05\).
LinearPop <- lm(tn$Pop_under_20_Pop_tot ~ Stud_Pop_under_20+Scuole+Classi+Media_stud_scuola+Media_stud_classe+Studenti, tn)
summary(LinearPop)
##
## Call:
## lm(formula = tn$Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole +
## Classi + Media_stud_scuola + Media_stud_classe + Studenti,
## data = tn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.059060 -0.011379 0.002452 0.010725 0.045638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.778e-01 9.719e-03 18.296 <2e-16 ***
## Stud_Pop_under_20 -1.154e-02 6.268e-03 -1.842 0.0679 .
## Scuole 9.683e-04 1.315e-03 0.736 0.4629
## Classi 5.377e-04 4.986e-04 1.078 0.2829
## Media_stud_scuola 1.387e-04 4.950e-05 2.802 0.0059 **
## Media_stud_classe 5.846e-04 9.422e-04 0.620 0.5361
## Studenti -3.461e-05 2.351e-05 -1.472 0.1435
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01878 on 124 degrees of freedom
## (35 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.184, Adjusted R-squared: 0.1445
## F-statistic: 4.659 on 6 and 124 DF, p-value: 0.0002638
By applying the step function, the mean of students per class and the number of classes are erased. However, the adjusted r-squared is too low to guarantee the quality of this model.
# Searching for a simplified model where every feature has high significance
LinearPop = step(LinearPop)
## Start: AIC=-1034.61
## tn$Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Classi +
## Media_stud_scuola + Media_stud_classe + Studenti
##
## Df Sum of Sq RSS AIC
## - Media_stud_classe 1 0.00013579 0.043877 -1036.2
## - Scuole 1 0.00019128 0.043933 -1036.0
## - Classi 1 0.00041026 0.044152 -1035.4
## <none> 0.043742 -1034.6
## - Studenti 1 0.00076467 0.044506 -1034.3
## - Stud_Pop_under_20 1 0.00119657 0.044938 -1033.1
## - Media_stud_scuola 1 0.00276903 0.046511 -1028.6
##
## Step: AIC=-1036.2
## tn$Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Classi +
## Media_stud_scuola + Studenti
##
## Df Sum of Sq RSS AIC
## - Classi 1 0.0002938 0.044171 -1037.3
## - Scuole 1 0.0004336 0.044311 -1036.9
## - Studenti 1 0.0006498 0.044527 -1036.3
## <none> 0.043877 -1036.2
## - Stud_Pop_under_20 1 0.0010645 0.044942 -1035.1
## - Media_stud_scuola 1 0.0094505 0.053328 -1012.6
##
## Step: AIC=-1037.33
## tn$Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Media_stud_scuola +
## Studenti
##
## Df Sum of Sq RSS AIC
## <none> 0.044171 -1037.3
## - Scuole 1 0.0007497 0.044921 -1037.1
## - Stud_Pop_under_20 1 0.0007708 0.044942 -1037.1
## - Studenti 1 0.0009575 0.045129 -1036.5
## - Media_stud_scuola 1 0.0092138 0.053385 -1014.5
summary(LinearPop)
##
## Call:
## lm(formula = tn$Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole +
## Media_stud_scuola + Studenti, data = tn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.059266 -0.011806 0.002099 0.010582 0.046054
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.826e-01 4.277e-03 42.690 < 2e-16 ***
## Stud_Pop_under_20 -7.607e-03 5.130e-03 -1.483 0.141
## Scuole 1.649e-03 1.128e-03 1.462 0.146
## Media_stud_scuola 1.596e-04 3.113e-05 5.127 1.08e-06 ***
## Studenti -1.100e-05 6.658e-06 -1.653 0.101
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01872 on 126 degrees of freedom
## (35 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.1759, Adjusted R-squared: 0.1498
## F-statistic: 6.725 on 4 and 126 DF, p-value: 6.134e-05
ols_res = data.frame(Neighbourhood = c(""),
Moran = c(""),
p_value = c(""))
# Moran test on residuals
for(i in 4:11){
t = lm.morantest(LinearPop,weights[[i]][[1]],resfun=rstudent)
ols_res = rbind(ols_res, c(weights[[i]][[2]],
t$estimate['Observed Moran I'],
t$p.value))
}
ols_res$Moran = as.numeric(ols_res$Moran)
ols_res$p_value = as.numeric(ols_res$p_value)
ols_res %>%
arrange(desc(abs(Moran))) %>%
filter(!is.na(Moran)) %>%
DT::datatable()
In this case instead, we can pay attention to the p-value, which is above \(0.05\) in all cases, except for knn with \(k=5\). This may indicate the presence of spatial autocorrelation in the residuals and consequently that the model is misspecified. Equivalent results have been obtained by rotating and discarding some predictors.
If spatial autocorrelation is present it will violate the assumption about the independence of residuals and call into question the validity of hypothesis testing6, leading to the rejection of the Population under 20 over the total population. However, since the spatial autocorrelation is found only in a single neighbourhood, it will still be studied in the next sections.
By looking at the plot we can see how the residuals in the most populated areas, as Trento and Rovereto, are low (i.e. yellow), while smaller municipalities around are red (i.e. high residuals).
par(mar=c(0,0,1,0))
studres <- rstudent(LinearPop)
resdistr <- round(quantile(studres, seq(0,1,0.25)), digits=3)
plot(tn, col=colours_5[findInterval(studres, resdistr, all.inside=TRUE)],
main = "Residuals quantiles in Trentino")
The following grid of images shows the Moran’s Scatterplot within all different notions of neighborhoods. The membership of municipalities inside a quadrant or another changes according to the neighbourhood definition. For instance, Novaledo often switches from HH to HL quadrant and Valfloriana from LH to LL and viceversa. Also, since this feature is a proportion that goes from \(0.10\) to \(0.24\), many municipalities overlap or assume the same value (columns of dots). Nevertheless, some outliers are noticeable, represented with a different style and their label.
As stated forehead, the quadrants with the pink line and its confidence interval shows the municipalities with similar proportion of students population over the total one (i.e. positive spatial autocorrelation). An example is the relationship between Novaledo and Vignola-Falesina, which are usually in the same HH quadrant, have a high proportion of students and they are close, geographically talking. On the opposite side, Cinte Tesino and Castello Tesino show a low value for students over the population.
On the other hand, municipalities on the remaining quadrants, have dissimilar values if they belong to opposite quadrants. For example, Fierozzo and Frassilongo share a boundary, but their proportion of students is \(0.22\) and \(0.15\) respectively, enhancing a big dissimilarity between these two municipalities.
This chunk shows how many municipalities belong to a specific quadrant, based on the different concepts of neighbourhood. It seems like 15 or less over 166 municipalities have a spatial autocorrelation and belong to a specific quadrant. These hotspots are sequently visualized with a colour, whose association is related to the membership to a specific quadrant.
Since the membership to quadrants is divided over the three different concepts of neighbourhood, we can observe that the critical cut-off neighbourhood has no hotspot in the HH quadrant, which represents the majority of municipalities for the contiguity approach.
## quadrants.knn
## HH HL LH LL None
## 1 3 2 5 155
## quadrants.dnb
## HL LH LL None
## 4 3 7 152
## quadrants.cont
## HH HL LH LL None
## 6 2 3 4 151
In order to better visualize them, the plot is served, showing in white the municipalities with no quadrant and the one inside the Moran’s quadrants with a gradient. Moreover, three of them are generated, one for each notion of neighbourhood, choosing a random \(k\) or \(d\) for knn and critical cut-off. This choice is due to the fact that different neighbourhoods lead to different hotspots, i.e. the points that emerge over others in the Moran’s scatterplot.
⚠️ It is suggested to open the image in a new window to better read municipalities labels, whose size has been reduced for readability issues.
par(mfrow=c(3,1))
par(mar=c(0,0,0,0))
quadrants = list(list(quadrants.knn,"KNN"), list(quadrants.dnb,"Cut-off"), list(quadrants.cont,"Contiguity"))
for(l in quadrants){
colourization = unlist(color_mapping[l[[1]]])
par(mar=c(0,0,1,0))
plot(tn, col=colourization, border = "grey", main=paste0("Regions with influence on students over population under 20 (neighbourhood = ",l[[2]],")"))
legend(x=11.38, y=45.95, legend=c("Low-Low", "Low-High", "High-Low", "High-High","None"),
fill=unlist(color_mapping), bty="n", cex=0.8)
text(tn_coords, labels = ifelse(l[[1]]=="None", "" ,tn@data$Comune), cex=0.7)
}
As can be noticed, while critical cut-off and knn focus on the same eastern area of Trentino, the contiguity approach selects municipalities all over the province, in particular Trento and Rovereto, the most populated ones, and the least ones in the border.
While knn and contiguity show Novaledo as HH, KNN and Cut-off also show Frassilongo, Valfloriana, Castello Tesino and Cinte Tesino (and others not always highlighted before), described above as recurring ones in the not HH quadrants.
# Municipalities in common between KNN and cut-off
tn$Comune[quadrants.knn != "None" & quadrants.dnb != "None"]
## [1] "Castello Tesino" "Cinte Tesino" "Imer" "Mezzano"
## [5] "Palù Del Fersina" "Ruffrè-Mendola" "Terragnolo" "Valfloriana"
# Novaledo in common in contiguity and knn
tn$Comune[quadrants.cont != "None" & quadrants.knn != "None"]
## [1] "Novaledo"
# Nothing in common between contiguity and cut-off
tn$Comune[quadrants.cont != "None" & quadrants.dnb != "None"]
## character(0)
Local statistics can be tested for deviations using the hypothesis of
absence of local spatial autocorrelation and hence can provide the
statistical significance of the local spatial patterns detected by the
Moran scatterplot. In particular, the function localmoran()
provides:
By sorting results according to the municipality with highest absolute value of the local Moran statistic \(I_i\) and by filtering those whose p-value is below \(0.05\), we obtain the areas with highest statistical significance of the local spatial patterns, in a clearer way than the Moran scatterplot. Specifically, Sagron Mis, Cinte Tesino and Castello Tesino seem to be the municipalities with highest Local Moran’s, but comparing the top municipalities in lmI with those in the quadrants computed through the scatterplot, we may notice that only 41% of municipalities identified through the plot are spatially significant in the Local Moran’s.
lmI <- localmoran(tn$Pop_under_20_Pop_tot, dnb24.list,
na.action = na.exclude)
lmI = data.frame(lmI)
rownames(lmI) = tn$Comune
lmI_sign = lmI%>%
filter(lmI$Pr.z....E.Ii..<0.05) %>%
arrange(desc(abs(Ii)))
DT::datatable(lmI_sign)
# Proportion of significant municipalities in Moran's Scatterplot
# versus those identified through local Moran's
sum(c(tn$Comune[quadrants.knn != "None" |
quadrants.dnb != "None" |
quadrants.cont != "None"]) %in% rownames(lmI_sign))/dim(lmI_sign)[1]
## [1] 0.4054054
The following two plots show:
par(mar=c(0,0,1,0))
brks <- sort(as.numeric(lmI[,1]))
colours <- colorRampPalette(c('#fedb71','#dd696d'))(length(lmI[,1]))
plot(tn, col=colours[findInterval(lmI[,1], brks, all.inside=TRUE)],
border="grey30")
title(main="Local Moran's I values")
text(tn_coords, labels = ifelse(tn@data$Comune %in% rownames(lmI_sign), tn@data$Comune ,""), cex=0.7)
# Mapping the p-value as color
pvalue_colors = c("white","#fedb71", "#F6BF70", "#E6866E","#DD696D")
pval <- as.numeric(lmI[,5])
colpval = ifelse(pval>0.05, pvalue_colors[1],
ifelse(pval>0.01 & pval<=0.05, pvalue_colors[2],
ifelse(pval>0.001 & pval<=0.01,pvalue_colors[3],
ifelse(pval>0.0001,pvalue_colors[4],pvalue_colors[5]))))
tn$colpval[pval>0.05] <- "white"
tn$colpval[pval<=0.05 & pval>0.01] <- gray(0.9)
tn$colpval[pval<=0.01 & pval>0.001] <- gray(0.7)
tn$colpval[pval<=0.001 & pval>0.0001] <- gray(0.4)
tn$colpval[pval<=0.0001] <- "black"
plot(tn, col=colpval)
legend(x=11.5, y=45.9, legend=c("Not significant",
"p-value = 0.05", "p-value = 0.01", "p-value = 0.001",
"p-value = 0.0001"), fill=pvalue_colors, bty="n", cex=1)
title(main="Local Moran's I significance map")
text(tn_coords, labels = ifelse(colpval == "white","", tn@data$Comune), cex=0.7)
Comparing the results obtained with the Scatterplots’, we can state that while the critical cut-off found significance in the eastern part (e.g. Primiero, Predazzo, Castello Tesino etc), the contiguity approach helped to find significance near Riva del Garda and around Trento.
The following grid of images shows the Moran’s Scatterplot within all different notions of neighborhoods, considering the mean of students per school. As in the previous case, the pink line has a positive slope, whose confidence interval continues to grow by increasing the \(x\) value. Still, some differences can be noticed from a neighbourhood to another and some municipalities, like Mezzocorona, Villa Lagarina and San Michele All’Adige, continues to switch from above to below the mean dashed line.
In HH quadrant, Trento, Rovereto, Mori, Mezzocorona, Avio, Lavis and Ala seem to be the municipalities with more students per school in the majority of plots, but also Brentonico, Giovo and Nago-Torbole, except when considering the critical cut-off neighbourhood.
On the LH part, communities like Drena (surrounded by Arco, Dro and Cavedine), Fornace (Baselga di Pinè and Civezzano), Ronzo-Chienis, Vallarsa (Ala, Rovereto) and Terragnolo (Rovereto, Folgaria) are those municipalities that have few students (less than \(100\) per school), but surrounded by municipalities with a lot of them.
This chunk shows how many municipalities belong to a specific quadrant, based on the different concepts of neighbourhood. It seems like 16 or less over 166 municipalities have a spatial autocorrelation and belong to a specific quadrant. In particular, both knn and cut-off neighbourhoods do not provide hotspots for the LL quadrant, while it exceeds with HH municipalities, compared to contiguity neighbourhood.
table(quadrants.knn)
## quadrants.knn
## HH HL LH None
## 12 1 2 151
table(quadrants.cont)
## quadrants.cont
## HH HL LH LL None
## 3 2 6 5 150
table(quadrants.dnb)
## quadrants.dnb
## HH HL LH None
## 6 2 3 155
The map displays the membership of municipalities to each of the quadrant in the Moran’s scatterplot, while in white leaving the remaining territories with no quadrant.
Starting with KNN version, the Adige valley is completely in red, showing the spatial similarity of high number of students per school: from Mezzocorona, to Trento, continuing to the neighbourhood of Rovereto and the bottom-center part of Trentino.
The cut-off neighbourhood instead considers partially the same territories of KNN, with less focus on Rovereto and more on some LH municipalities, as Vallarsa, Trambileno and Terragnolo.
In the end, the contiguity approach detaches itself from previous approaches to show apparently random municipalities, some in common with knn and cut-off (e.g. Roverè della Luna, Lavis, Trambileno, Nago-Torbole and Mori).
par(mfrow=c(3,1))
par(mar=c(0,0,0,0))
quadrants = list(list(quadrants.knn,"KNN"), list(quadrants.dnb,"Cut-off"), list(quadrants.cont,"Contiguity"))
for(l in quadrants){
colourization = unlist(color_mapping[l[[1]]])
par(mar=c(0,0,1,0))
plot(tn, col=colourization, border = "grey", main=paste0("Regions with influence on mean number of students per school (neighbourhood = ",l[[2]],")"))
legend(x=11.38, y=45.95, legend=c("Low-Low", "Low-High", "High-Low", "High-High","None"),
fill=unlist(color_mapping), bty="n", cex=0.8)
text(tn_coords, labels = ifelse(l[[1]]=="None", "" ,tn@data$Comune), cex=0.7)
}
# Municipalities in common between KNN and cut-off
tn$Comune[quadrants.knn != "None" & quadrants.dnb != "None"]
## [1] "Avio" "Lavis" "Levico Terme"
## [4] "Mezzocorona" "Mori" "San Michele All'Adige"
## [7] "Trento" "Villa Lagarina"
# Novaledo in common in contiguity and knn
tn$Comune[quadrants.cont != "None" & quadrants.knn != "None"]
## [1] "Lavis" "Mori" "Nago-Torbole"
## [4] "Roverè Della Luna"
# Nothing in common between contiguity and cut-off
tn$Comune[quadrants.cont != "None" & quadrants.dnb != "None"]
## [1] "Lavis" "Mori" "Trambileno"
This time, we will focus on the mean of students per school, whose values comprehend also NAs, which will be excluded. As before, we create a dataframe ordered by the absolute value of Local Moran’s \(I_i\), filtered by the p-value below \(0.05\).
lmI <- localmoran(tn$Media_stud_scuola, dnb24.list, na.action = na.exclude)
lmI = data.frame(lmI)
rownames(lmI) = tn$Comune
lmI_sign = lmI%>%
filter(lmI$Pr.z....E.Ii..<0.05) %>%
arrange(desc(abs(Ii)))
DT::datatable(lmI_sign)
# Proportion of significant municipalities in Moran's Scatterplot
# versus those identified through local Moran's
sum(c(tn$Comune[quadrants.knn != "None" |
quadrants.dnb != "None" |
quadrants.cont != "None"]) %in% rownames(lmI_sign))/dim(lmI_sign)[1]
## [1] 0.2926829
41 municipalities find a significant statistic for the local spatial autocorrelation and just \(29\%\) are in common with the ones found in the Moran scatterplots’ quadrants.
The following two plots show:
par(mar=c(0,0,1,0))
brks <- sort(as.numeric(lmI[,1]))
colours <- colorRampPalette(c('#fedb71','#dd696d'))(length(lmI[,1]))
plot(tn, col=colours[findInterval(lmI[,1], brks, all.inside=TRUE)],
border="grey30")
title(main="Local Moran's I values")
text(tn_coords, labels = ifelse(tn@data$Comune %in% rownames(lmI_sign), tn@data$Comune ,""), cex=0.7)
# Mapping the p-value as color
pvalue_colors = c("white","#fedb71", "#F6BF70", "#E6866E","#DD696D")
pval <- as.numeric(lmI[,5])
colpval = ifelse(pval>0.05, pvalue_colors[1],
ifelse(pval>0.01 & pval<=0.05, pvalue_colors[2],
ifelse(pval>0.001 & pval<=0.01,pvalue_colors[3],
ifelse(pval>0.0001,pvalue_colors[4],pvalue_colors[5]))))
tn$colpval[pval>0.05] <- "white"
tn$colpval[pval<=0.05 & pval>0.01] <- gray(0.9)
tn$colpval[pval<=0.01 & pval>0.001] <- gray(0.7)
tn$colpval[pval<=0.001 & pval>0.0001] <- gray(0.4)
tn$colpval[pval<=0.0001] <- "black"
plot(tn, col=colpval)
legend(x=11.5, y=45.9, legend=c("Not significant",
"p-value = 0.05", "p-value = 0.01", "p-value = 0.001",
"p-value = 0.0001"), fill=pvalue_colors, bty="n", cex=1)
title(main="Local Moran's I significance map")
text(tn_coords, labels = ifelse(colpval == "white","", tn@data$Comune), cex=0.6)
These results deviate from those discovered through the scatterplot, since here two big clusters are evident, while in the plots with all neighbourhood definitions territories with spatial autocorrelation are quite sparse. Nevertheless, some territories are in common, especially in Rovereto area.
This last inquiry is due to the great gap discovered in the choropleth map inside the website that shows the proportion of actual students over the total population under 20 years old of every municipality. It seems in fact that some municipalities host more students than those who actually live in that area, leading to the possible conclusion that students may need to move from their city to another to go to school every day.
The remaining grid shows this time a negative trend of the relationship between x and the spatially lagged x, indicating that there are few close municipalities with a high number of students over the young population, but mostly of them share dissimilar values with their neighbours. In fact, this may be due to the problem highlighted before, related to the movement of students across different municipalities to go to school, especially middle, high and professional schools.
By assigning a quadrant to each municipality we can easily access to the number and the names of municipalities where students come from other areas or that have the necessity of moving to go to school. In fact, as shown in the assignation of numbers to quadrants, the most populated areas are HL and LH, showing a trend of negative spatial autocorrelation.
table(quadrants.knn)
## quadrants.knn
## HL LH LL None
## 6 4 2 154
table(quadrants.cont)
## quadrants.cont
## HH HL LH None
## 4 7 2 153
table(quadrants.dnb)
## quadrants.dnb
## HH HL LH None
## 2 5 2 157
In this case the color mapping has been changed to highlight the quadrants with most of the municipalities, whose values are dissimilar (i.e. HL in red, LH in yellow).
As in the previous two analysis of the scatterplot, knn and cut-off show more similarities than with the contiguity approach, thus pointing out the municipalities that host more students than those who actually live inside the territory (i.e. they welcome students from their neighbours).
The most critical situation is around Tione di Trento, which has a high number of students over its under 20 population, while Valdaone, Selle Giudicarie, Porte di Rendena and Pelugo lack of students. Notice however that we do not hold data about the number of students in Valdaone and most of them are missing in those zones.
Something similar but limited happens between Canazei and Giovanni di Fassa, since their closeness and the gap of students in the first against the aboundance in the second may imply that students in Canazei move to Giovanni di Fassa to go to school.
color_mapping = list("LL" = "#F6Bf70",
"LH" = "#FEDB71",
"HL" = "#E0716E",
"HH" = "#E6866E",
"None" = "white")
par(mfrow=c(3,1))
par(mar=c(0,0,0,0))
quadrants = list(list(quadrants.knn,"KNN"),
list(quadrants.dnb,"Cut-off"),
list(quadrants.cont,"Contiguity"))
for(l in quadrants){
colourization = unlist(color_mapping[l[[1]]])
par(mar=c(0,0,1,0))
plot(tn, col=colourization, border = "grey",
main=paste0("Regions with influence on students over population under 20 (neighbourhood = ",l[[2]],")"))
legend(x=11.38, y=45.95,
legend=c("Low-Low", "Low-High", "High-Low", "High-High","None"),
fill=unlist(color_mapping), bty="n", cex=0.8)
text(tn_coords, labels = ifelse(l[[1]]=="None", "" ,tn@data$Comune), cex=0.7)
}
# Municipalities in common between KNN and cut-off
# (those that for sure host students from their neighbours)
tn$Comune[quadrants.knn != "None" & quadrants.dnb != "None"]
## [1] "Bondone" "Cles" "Ossana" "Tione Di Trento"
# Cinte Tesino in common in contiguity and knn
tn$Comune[quadrants.cont != "None" & quadrants.knn != "None"]
## [1] "Cinte Tesino"
# Nothing in common between contiguity and cut-off
tn$Comune[quadrants.cont != "None" & quadrants.dnb != "None"]
## character(0)
As in the previous section, also here not all municipalities detain data about students over the population under 20, therefore missing values are excluded. Since there are only few municipalities whose p-value is below \(0.05\) (i.e. Molveno, Cimone, Aldeno and Storo), the dataframe below accepts all rows whose p-value is below \(0.1\).
lmI <- localmoran(tn$Stud_Pop_under_20, dnb24.list,
na.action = na.exclude)
lmI = data.frame(lmI)
rownames(lmI) = tn$Comune
lmI_sign = lmI%>%
filter(lmI$Pr.z....E.Ii..<0.1) %>%
arrange(desc(abs(Ii)))
DT::datatable(lmI_sign)
# Proportion of significant municipalities in Moran's Scatterplot
# versus those identified through local Moran's
sum(c(tn$Comune[quadrants.knn != "None" |
quadrants.dnb != "None" |
quadrants.cont != "None"]) %in% rownames(lmI_sign))/dim(lmI_sign)[1]
## [1] 0.2352941
17 municipalities find a significant statistic for the local spatial autocorrelation and just \(24\%\) are in common with the ones found in the Moran scatterplots’ quadrants.
The following two plots show:
par(mar=c(0,0,1,0))
brks <- sort(as.numeric(lmI[,1]))
colours <- colorRampPalette(c('#fedb71','#dd696d'))(length(lmI[,1]))
plot(tn, col=colours[findInterval(lmI[,1], brks, all.inside=TRUE)],
border="grey30")
title(main="Local Moran's I values")
text(tn_coords, labels = ifelse(lmI$Ii>=0.2 | lmI$Ii<=-0.2, tn@data$Comune ,""), cex=0.7)
# Mapping the p-value as color
pvalue_colors = c("white","#fedb71", "#F6BF70", "#E6866E","#DD696D")
pval <- as.numeric(lmI[,5])
colpval = ifelse(pval>0.05, pvalue_colors[1],
ifelse(pval>0.01 & pval<=0.05, pvalue_colors[2],
ifelse(pval>0.001 & pval<=0.01,pvalue_colors[3],
ifelse(pval>0.0001,pvalue_colors[4],pvalue_colors[5]))))
tn$colpval[pval>0.05] <- "white"
tn$colpval[pval<=0.05 & pval>0.01] <- gray(0.9)
tn$colpval[pval<=0.01 & pval>0.001] <- gray(0.7)
tn$colpval[pval<=0.001 & pval>0.0001] <- gray(0.4)
tn$colpval[pval<=0.0001] <- "black"
plot(tn, col=colpval)
legend(x=11.5, y=45.9, legend=c("Not significant",
"p-value = 0.05", "p-value = 0.01", "p-value = 0.001",
"p-value = 0.0001"), fill=pvalue_colors, bty="n", cex=1)
title(main="Local Moran's I significance map")
text(tn_coords, labels = ifelse(colpval == "white","", tn@data$Comune), cex=0.6)
Let’s start with the SDM model. The lagsarlm()9 function
provides Maximum likelihood estimation of spatial simultaneous
autoregressive lag and spatial Durbin (mixed) models of the form:
\[ Y = \rho Wy+X\beta+\epsilon \]
As shown in the summary, only the mean of students per school is a relevant predictor, none of the lag or the remaining predictors’ p-value is close to the \(0.05\) threshold to demonstrate statistical significance. Also, the overall p-value is above \(0.5\), therefore no significance is shown by the SDM.
library(spatialreg)
# Estimate the SDM model using the Maximum likelihood estimator
SDM <- lagsarlm(Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Media_stud_scuola + Studenti, tn, listw=dnb16.list,
type="mixed", na.action = na.ignore)
summary(SDM)
##
## Call:lagsarlm(formula = Pop_under_20_Pop_tot ~ Stud_Pop_under_20 +
## Scuole + Media_stud_scuola + Studenti, data = tn, listw = dnb16.list,
## na.action = na.ignore, type = "mixed")
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0822634 -0.0120025 0.0012275 0.0132263 0.0510890
##
## Type: mixed
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.0941e-01 4.2818e-02 4.8908 1.004e-06
## Stud_Pop_under_20 -3.5086e-03 3.7552e-03 -0.9343 0.3501250
## Scuole 1.0641e-03 1.2364e-03 0.8606 0.3894323
## Media_stud_scuola 1.3926e-04 3.6987e-05 3.7652 0.0001664
## Studenti -7.4425e-06 7.2944e-06 -1.0203 0.3075849
## lag.Stud_Pop_under_20 -1.2982e-02 1.6821e-02 -0.7718 0.4402583
## lag.Scuole -3.3139e-03 4.6572e-03 -0.7116 0.4767314
## lag.Media_stud_scuola 8.7287e-05 1.1580e-04 0.7538 0.4509847
## lag.Studenti 2.6229e-05 2.8213e-05 0.9297 0.3525434
##
## Rho: -0.14384, LR test value: 0.44322, p-value: 0.50557
## Asymptotic standard error: 0.23123
## z-value: -0.62207, p-value: 0.5339
## Wald statistic: 0.38697, p-value: 0.5339
##
## Log likelihood: 409.4587 for mixed model
## ML residual variance (sigma squared): 0.00042124, (sigma: 0.020524)
## Number of observations: 166
## Number of parameters estimated: 11
## AIC: -796.92, (AIC for lm: -798.47)
## LM test for residual autocorrelation
## test value: 0.0993, p-value: 0.75267
Same principle applies to SAR, which has an higher p-value than SDM, indicating the absence of statistical significance. The relevance of predictors remains unaltered.
# Estimate the SAR model using the Maximum likelihood estimator
SAR <- lagsarlm(Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Media_stud_scuola + Studenti, tn, listw=dnb16.list, na.action = na.ignore)
summary(SAR)
##
## Call:lagsarlm(formula = Pop_under_20_Pop_tot ~ Stud_Pop_under_20 +
## Scuole + Media_stud_scuola + Studenti, data = tn, listw = dnb16.list,
## na.action = na.ignore)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0851982 -0.0122352 0.0021982 0.0137631 0.0547369
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.7100e-01 3.6728e-02 4.6558 3.228e-06
## Stud_Pop_under_20 -4.3445e-03 3.6708e-03 -1.1835 0.2366
## Scuole 1.6804e-03 1.2191e-03 1.3784 0.1681
## Media_stud_scuola 1.5913e-04 3.4866e-05 4.5642 5.014e-06
## Studenti -1.1361e-05 7.1638e-06 -1.5858 0.1128
##
## Rho: 0.050613, LR test value: 0.078455, p-value: 0.7794
## Asymptotic standard error: 0.18948
## z-value: 0.26712, p-value: 0.78937
## Wald statistic: 0.071355, p-value: 0.78937
##
## Log likelihood: 407.2789 for lag model
## ML residual variance (sigma squared): 0.0004329, (sigma: 0.020806)
## Number of observations: 166
## Number of parameters estimated: 7
## AIC: -800.56, (AIC for lm: -802.48)
## LM test for residual autocorrelation
## test value: 1.281, p-value: 0.25772
Now we skip to the local spillover specifications, starting from SDEM, which shows no significance in terms of p-value, both of the model and about the predictors.
# Estimate the SDEM model using the Maximum likelihood estimator
SDEM <- errorsarlm(Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Media_stud_scuola + Studenti, tn, listw=dnb16.list,
etype = "emixed", na.action = na.ignore)
summary(SDEM)
##
## Call:errorsarlm(formula = Pop_under_20_Pop_tot ~ Stud_Pop_under_20 +
## Scuole + Media_stud_scuola + Studenti, data = tn, listw = dnb16.list,
## na.action = na.ignore, etype = "emixed")
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0823709 -0.0118217 0.0011253 0.0132666 0.0510749
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.8366e-01 9.5850e-03 19.1616 < 2.2e-16
## Stud_Pop_under_20 -3.4720e-03 3.7524e-03 -0.9253 0.3548242
## Scuole 1.1078e-03 1.2315e-03 0.8995 0.3683669
## Media_stud_scuola 1.3904e-04 3.7117e-05 3.7459 0.0001797
## Studenti -7.7213e-06 7.2580e-06 -1.0638 0.2874074
## lag.Stud_Pop_under_20 -1.1487e-02 1.5701e-02 -0.7316 0.4644119
## lag.Scuole -3.3237e-03 4.3966e-03 -0.7560 0.4496691
## lag.Media_stud_scuola 5.3972e-05 1.0057e-04 0.5366 0.5915187
## lag.Studenti 2.6360e-05 2.6723e-05 0.9864 0.3239349
##
## Lambda: -0.14789, LR test value: 0.44493, p-value: 0.50475
## Asymptotic standard error: 0.2341
## z-value: -0.63174, p-value: 0.52756
## Wald statistic: 0.3991, p-value: 0.52756
##
## Log likelihood: 409.4596 for error model
## ML residual variance (sigma squared): 0.0004212, (sigma: 0.020523)
## Number of observations: 166
## Number of parameters estimated: 11
## AIC: -796.92, (AIC for lm: -798.47)
The same happens for the SEM.
# Estimate the SEM model using the Maximum likelihood estimator
SEM <- errorsarlm(Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Media_stud_scuola + Studenti, tn, listw=dnb16.list, na.action = na.ignore)
summary(SEM)
##
## Call:errorsarlm(formula = Pop_under_20_Pop_tot ~ Stud_Pop_under_20 +
## Scuole + Media_stud_scuola + Studenti, data = tn, listw = dnb16.list,
## na.action = na.ignore)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0849250 -0.0123802 0.0025571 0.0132241 0.0550477
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.8066e-01 3.6292e-03 49.7798 < 2.2e-16
## Stud_Pop_under_20 -4.5948e-03 3.6494e-03 -1.2590 0.2080
## Scuole 1.6581e-03 1.2192e-03 1.3600 0.1738
## Media_stud_scuola 1.6298e-04 3.4221e-05 4.7625 1.912e-06
## Studenti -1.1209e-05 7.1655e-06 -1.5643 0.1177
##
## Lambda: -0.043474, LR test value: 0.040879, p-value: 0.83977
## Asymptotic standard error: 0.22263
## z-value: -0.19528, p-value: 0.84518
## Wald statistic: 0.038133, p-value: 0.84518
##
## Log likelihood: 407.2601 for error model
## ML residual variance (sigma squared): 0.00043301, (sigma: 0.020809)
## Number of observations: 166
## Number of parameters estimated: 7
## AIC: -800.52, (AIC for lm: -802.48)
On the other hand, the LDM shows the same relevance for predictors, but a low high-value, demonstrating the significance of this estimation.
# Estimate the LDM model using the OLS estimator
LDM <- lmSLX(Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Media_stud_scuola + Studenti,tn, listw=dnb16.list, na.action = na.ignore)
summary(LDM)
##
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))),
## data = as.data.frame(x), weights = weights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.082928 -0.011458 0.001065 0.013532 0.051027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.832e-01 1.060e-02 17.286 < 2e-16 ***
## Stud_Pop_under_20 -3.402e-03 3.868e-03 -0.879 0.380477
## Scuole 1.150e-03 1.272e-03 0.904 0.367383
## Media_stud_scuola 1.383e-04 3.810e-05 3.630 0.000383 ***
## Studenti -7.986e-06 7.500e-06 -1.065 0.288594
## lag.Stud_Pop_under_20 -1.118e-02 1.716e-02 -0.651 0.515795
## lag.Scuole -3.319e-03 4.798e-03 -0.692 0.490169
## lag.Media_stud_scuola 6.130e-05 1.078e-04 0.568 0.570514
## lag.Studenti 2.585e-05 2.907e-05 0.889 0.375166
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02115 on 157 degrees of freedom
## Multiple R-squared: 0.1814, Adjusted R-squared: 0.1396
## F-statistic: 4.348 on 8 and 157 DF, p-value: 9.297e-05
The calculation of impacts for spatial lag and spatial Durbin models
is needed in order to interpret the regression coefficients correctly,
because of the spillovers between the terms in these data generation
processes[spdep Documentation, https://www.rdocumentation.org/packages/spdep/versions/1.0-2/topics/impacts][LeSage
J and RK Pace (2009) Introduction to Spatial Econometrics. CRC Press,
Boca Raton, pp. 33–42, 114–115; LeSage J and MM Fischer (2008) Spatial
growth regressions: model specification, estimation and interpretation.
Spatial Economic Analysis 3 (3), pp. 275–304]10. The method
impacts returns the average direct, indirect and total
impacts for the variables in the model, for the variables themselves in
the spatial lag model case, for the variables and their spatial lags in
the spatial Durbin (mixed) model case.
The Lagrange multiplier (LM) test of spatial dependence on OLS residuals. In the LM test the alternative hypothesis is explicitly considered to contrast the null #of absence of spatial dependence. In particular, we can explicitly express the alternative hypothesis either in the form of a SL or of a SEM.
The function lm.LMtests() reports the estimates of tests
chosen among 4 statistics for testing for spatial dependence in linear
models, which are:
According to this diagnostic, no model has been found significant by looking at the p-value.
OLSmodel <- lm(Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole + Media_stud_scuola + Studenti, data = tn, na.action = na.ignore)
natOLSlmTests <- lm.LMtests(OLSmodel, dnb16.list,
test=c("LMerr", "LMlag", "RLMerr", "RLMlag"))
summary(natOLSlmTests)
## Lagrange multiplier diagnostics for spatial dependence
## data:
## model: lm(formula = Pop_under_20_Pop_tot ~ Stud_Pop_under_20 + Scuole +
## Media_stud_scuola + Studenti, data = tn, na.action = na.ignore)
## weights: dnb16.list
##
## statistic parameter p.value
## LMerr 0.042152 1 0.8373
## LMlag 0.090537 1 0.7635
## RLMerr 1.141962 1 0.2852
## RLMlag 1.190346 1 0.2753
In conclusion, a positive significant direct impact on the proportion of population under 20 over the total population has been found through the mean number of students per school, implying that higher levels of potential students in the territory lead to the increase of mean students per school in the same neighbourhood.
Again, we can repeat the same analysis with the mean of students per class, using the same predictors used in the OLS residuals section (i.e. optimal model).
Starting from SDM, all predictors seem to gain high relevance given the low p-value, while their lagged values are relevant only for the mean students per school and slightly for the number of classes. The overall p-value of the model is below \(0.05\) and therefore considered statistically significant.
SDM <- lagsarlm(Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi + Media_stud_scuola + Studenti, tn, listw=dnb16.list,
type="mixed", na.action = na.ignore)
summary(SDM)
##
## Call:lagsarlm(formula = Media_stud_classe ~ Stud_Pop_under_20 + Scuole +
## Classi + Media_stud_scuola + Studenti, data = tn, listw = dnb16.list,
## na.action = na.ignore, type = "mixed")
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.17522 -1.30828 0.03293 1.10682 7.30820
##
## Type: mixed
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.3716051 1.8391923 4.5518 5.319e-06
## Stud_Pop_under_20 7.0792138 0.4082789 17.3392 < 2.2e-16
## Scuole 0.8148910 0.1296859 6.2836 3.309e-10
## Classi -0.3689308 0.0487037 -7.5750 3.597e-14
## Media_stud_scuola 0.0392220 0.0037829 10.3683 < 2.2e-16
## Studenti 0.0138605 0.0024529 5.6507 1.598e-08
## lag.Stud_Pop_under_20 3.2655810 2.4964221 1.3081 0.19084
## lag.Scuole 1.0207311 0.5441222 1.8759 0.06067
## lag.Classi -0.5003442 0.2590131 -1.9317 0.05339
## lag.Media_stud_scuola 0.0408093 0.0161537 2.5263 0.01153
## lag.Studenti 0.0200382 0.0125349 1.5986 0.10991
##
## Rho: -0.55537, LR test value: 4.4195, p-value: 0.035531
## Asymptotic standard error: 0.26076
## z-value: -2.1298, p-value: 0.033185
## Wald statistic: 4.5362, p-value: 0.033185
##
## Log likelihood: -358.8017 for mixed model
## ML residual variance (sigma squared): 4.3451, (sigma: 2.0845)
## Number of observations: 166
## Number of parameters estimated: 13
## AIC: 743.6, (AIC for lm: 746.02)
## LM test for residual autocorrelation
## test value: 5.8818, p-value: 0.015298
In SAR case instead, despite all predictors seem to have high relevance, the spatial autoregressive parameter \(\rho\) (Rho) is not significant due to high p-value on both the asymptotic t-test and Likelihood ratio test, but the LM test for residual autocorrelation has low p-value, indicating the presence of spatial autocorrelation in the residuals.
SAR <- lagsarlm(Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi + Media_stud_scuola + Studenti, tn, listw=dnb16.list, na.action = na.ignore)
summary(SAR)
##
## Call:lagsarlm(formula = Media_stud_classe ~ Stud_Pop_under_20 + Scuole +
## Classi + Media_stud_scuola + Studenti, data = tn, listw = dnb16.list,
## na.action = na.ignore)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.61926 -1.11213 -0.15396 1.17342 7.59740
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.4160302 0.9076190 5.9673 2.412e-09
## Stud_Pop_under_20 7.0058916 0.4183042 16.7483 < 2.2e-16
## Scuole 0.8438018 0.1303428 6.4737 9.562e-11
## Classi -0.3709965 0.0493342 -7.5201 5.485e-14
## Media_stud_scuola 0.0406608 0.0037524 10.8360 < 2.2e-16
## Studenti 0.0137928 0.0024817 5.5578 2.733e-08
##
## Rho: 0.09763, LR test value: 1.81, p-value: 0.17851
## Asymptotic standard error: 0.076863
## z-value: 1.2702, p-value: 0.20402
## Wald statistic: 1.6133, p-value: 0.20402
##
## Log likelihood: -363.5075 for lag model
## ML residual variance (sigma squared): 4.6699, (sigma: 2.161)
## Number of observations: 166
## Number of parameters estimated: 8
## AIC: 743.02, (AIC for lm: 742.83)
## LM test for residual autocorrelation
## test value: 3.9831, p-value: 0.04596
In case of local spillover specifications, all predictors are relevant, but not their lagged version. The \(\lambda\) (Lambda) estimator is highly significant due to low p-value on both Likelihood ratio and asymptotic t-tests.
SDEM <- errorsarlm(Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi + Media_stud_scuola + Studenti, tn, listw=dnb16.list,
etype = "emixed", na.action = na.ignore)
summary(SDEM)
##
## Call:errorsarlm(formula = Media_stud_classe ~ Stud_Pop_under_20 +
## Scuole + Classi + Media_stud_scuola + Studenti, data = tn,
## listw = dnb16.list, na.action = na.ignore, etype = "emixed")
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.835976 -1.281293 -0.071948 1.124115 7.072121
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.1633884 0.7473589 6.9088 4.886e-12
## Stud_Pop_under_20 7.0991172 0.4126814 17.2024 < 2.2e-16
## Scuole 0.8014391 0.1303753 6.1472 7.888e-10
## Classi -0.3631380 0.0482991 -7.5185 5.529e-14
## Media_stud_scuola 0.0379363 0.0038330 9.8973 < 2.2e-16
## Studenti 0.0136307 0.0024354 5.5970 2.181e-08
## lag.Stud_Pop_under_20 -0.3096908 1.2769796 -0.2425 0.80838
## lag.Scuole 0.4894061 0.3759430 1.3018 0.19298
## lag.Classi -0.3029760 0.2105152 -1.4392 0.15009
## lag.Media_stud_scuola 0.0159678 0.0095859 1.6658 0.09576
## lag.Studenti 0.0130679 0.0103491 1.2627 0.20669
##
## Lambda: -0.80016, LR test value: 7.2953, p-value: 0.0069134
## Asymptotic standard error: 0.27703
## z-value: -2.8884, p-value: 0.0038722
## Wald statistic: 8.3428, p-value: 0.0038722
##
## Log likelihood: -357.3637 for error model
## ML residual variance (sigma squared): 4.2043, (sigma: 2.0504)
## Number of observations: 166
## Number of parameters estimated: 13
## AIC: 740.73, (AIC for lm: 746.02)
In case of SEM instead, the estimator is not statistically significant in both tests due to p-value around \(0.3\).
SEM <- errorsarlm(Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi + Media_stud_scuola + Studenti, tn, listw=dnb16.list, na.action = na.ignore)
summary(SEM)
##
## Call:errorsarlm(formula = Media_stud_classe ~ Stud_Pop_under_20 +
## Scuole + Classi + Media_stud_scuola + Studenti, data = tn,
## listw = dnb16.list, na.action = na.ignore)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.372522 -1.197946 -0.013962 1.361034 7.809514
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.3045147 0.3614975 17.4400 < 2.2e-16
## Stud_Pop_under_20 6.8364576 0.4094121 16.6982 < 2.2e-16
## Scuole 0.8768414 0.1302784 6.7305 1.691e-11
## Classi -0.3783119 0.0494259 -7.6541 1.954e-14
## Media_stud_scuola 0.0435938 0.0034738 12.5491 < 2.2e-16
## Studenti 0.0139824 0.0024918 5.6113 2.008e-08
##
## Lambda: -0.25333, LR test value: 0.96423, p-value: 0.32612
## Asymptotic standard error: 0.24437
## z-value: -1.0366, p-value: 0.2999
## Wald statistic: 1.0746, p-value: 0.2999
##
## Log likelihood: -363.9304 for error model
## ML residual variance (sigma squared): 4.6797, (sigma: 2.1633)
## Number of observations: 166
## Number of parameters estimated: 8
## AIC: 743.86, (AIC for lm: 742.83)
Also for LDM predictors are relevant, except for their lagged values. The p-value is below the \(0.05\) threshold, concluding the significance of this model.
LDM <- lmSLX(Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi + Media_stud_scuola + Studenti,tn, listw=dnb16.list, na.action = na.ignore)
summary(LDM)
##
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))),
## data = as.data.frame(x), weights = weights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4798 -1.2584 -0.0412 1.0929 7.3924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.290053 1.127564 4.692 5.91e-06 ***
## Stud_Pop_under_20 7.066653 0.431613 16.373 < 2e-16 ***
## Scuole 0.803002 0.137004 5.861 2.67e-08 ***
## Classi -0.372470 0.051483 -7.235 2.02e-11 ***
## Media_stud_scuola 0.039215 0.003993 9.821 < 2e-16 ***
## Studenti 0.014097 0.002593 5.436 2.07e-07 ***
## lag.Stud_Pop_under_20 -0.724507 1.824028 -0.397 0.692
## lag.Scuole 0.492397 0.512888 0.960 0.339
## lag.Classi -0.293354 0.251452 -1.167 0.245
## lag.Media_stud_scuola 0.014574 0.011327 1.287 0.200
## lag.Studenti 0.012448 0.012624 0.986 0.326
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.204 on 155 degrees of freedom
## Multiple R-squared: 0.9091, Adjusted R-squared: 0.9032
## F-statistic: 155 on 10 and 155 DF, p-value: < 2.2e-16
All predictors’ impacts are significant, if looking at direct and total impacts. Results show particular significance for:
Also, a cumulative increase will happen by changing neighbourhoods if considering an increase in terms of mean of students per school and number of students, with lower impact than the previous three mentioned predictors.
impSAR <- impacts(SAR, listw=dnb16.list, R=100)
summary(impSAR, zstats=TRUE, short=TRUE)
## Impact measures (lag, exact):
## Direct Indirect Total
## Stud_Pop_under_20 7.01020865 0.753670577 7.76387923
## Scuole 0.84432172 0.090773394 0.93509511
## Classi -0.37122509 -0.039910570 -0.41113566
## Media_stud_scuola 0.04068585 0.004374154 0.04506001
## Studenti 0.01380133 0.001483787 0.01528511
## ========================================================
## Simulation results ( variance matrix):
## ========================================================
## Simulated standard errors
## Direct Indirect Total
## Stud_Pop_under_20 0.452330015 0.691290352 0.923475851
## Scuole 0.128019566 0.080993591 0.151433385
## Classi 0.047550669 0.034698936 0.058281087
## Media_stud_scuola 0.004460688 0.003931098 0.005611523
## Studenti 0.002415223 0.001294252 0.002824999
##
## Simulated z-values:
## Direct Indirect Total
## Stud_Pop_under_20 15.457710 1.131066 8.418067
## Scuole 6.598456 1.124764 6.179814
## Classi -7.800642 -1.158092 -7.053923
## Media_stud_scuola 9.205845 1.128736 8.108596
## Studenti 5.700484 1.153396 5.402028
##
## Simulated p-values:
## Direct Indirect Total
## Stud_Pop_under_20 < 2.22e-16 0.25803 < 2.22e-16
## Scuole 4.1546e-11 0.26069 6.4177e-10
## Classi 6.2172e-15 0.24683 1.7395e-12
## Media_stud_scuola < 2.22e-16 0.25901 4.4409e-16
## Studenti 1.1947e-08 0.24875 6.5892e-08
In this second case, the p-value is higher than before, but still implying significance mainly for students over population under \(20\) (increase), number of schools (increase) and number of classes (decrease).
#For the SD specification:
impSDM <- impacts(SDM, listw=dnb16.list, R=100)
summary(impSDM, zstats=TRUE, short=TRUE)
## Impact measures (mixed, exact):
## Direct Indirect Total
## Stud_Pop_under_20 7.09719669 -0.44619729 6.65099940
## Scuole 0.79955069 0.38062945 1.18018014
## Classi -0.36095370 -0.19793100 -0.55888471
## Media_stud_scuola 0.03870830 0.01274642 0.05145472
## Studenti 0.01352731 0.00826724 0.02179455
## ========================================================
## Simulation results ( variance matrix):
## ========================================================
## Simulated standard errors
## Direct Indirect Total
## Stud_Pop_under_20 0.420075417 1.053428744 1.018988807
## Scuole 0.137524385 0.322052400 0.326114072
## Classi 0.047700732 0.171224404 0.165702318
## Media_stud_scuola 0.003614372 0.007334518 0.007246858
## Studenti 0.002403222 0.008611326 0.008526823
##
## Simulated z-values:
## Direct Indirect Total
## Stud_Pop_under_20 16.850589 -0.5012483 6.428421
## Scuole 5.825831 1.3514950 3.791453
## Classi -7.599245 -1.1080040 -3.332524
## Media_stud_scuola 10.714155 1.8933487 7.259937
## Studenti 5.655192 0.8653011 2.467750
##
## Simulated p-values:
## Direct Indirect Total
## Stud_Pop_under_20 < 2.22e-16 0.616196 1.2894e-10
## Scuole 5.6829e-09 0.176537 0.00014977
## Classi 2.9754e-14 0.267860 0.00086062
## Media_stud_scuola < 2.22e-16 0.058312 3.8725e-13
## Studenti 1.5567e-08 0.386874 0.01359651
Focusing on the Lagrange multiplier diagnostics for spatial dependence, let’s try focusing on the same model but reducing the number of predictors to those with higher impacts on both SEM and SDM models.
According to the summary, all statistics are significant, except for Robust Linear Model test for error dependence (RLMerr).
OLSmodel <- lm(Media_stud_classe ~ Stud_Pop_under_20 + Scuole + Classi, data = tn, na.action = na.ignore)
natOLSlmTests <- lm.LMtests(OLSmodel, dnb16.list,
test=c("LMerr", "LMlag", "RLMerr", "RLMlag"))
summary(natOLSlmTests)
## Lagrange multiplier diagnostics for spatial dependence
## data:
## model: lm(formula = Media_stud_classe ~ Stud_Pop_under_20 + Scuole +
## Classi, data = tn, na.action = na.ignore)
## weights: dnb16.list
##
## statistic parameter p.value
## LMerr 12.82227 1 0.0003425 ***
## LMlag 23.97668 1 9.751e-07 ***
## RLMerr 0.38852 1 0.5330771
## RLMlag 11.54293 1 0.0006801 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The selection strategy of the above mentioned models proposed by Elhorst (2010) is to:
Estimate the OLS model and test (with the LM test) whether the SL or the SEM is more appropriate to describe the data;
If the OLS model is rejected in favour of the SL, the SEM or in favour of both models, then the SDM should be estimated;
Likelihood ratio (LR) tests can subsequently be used to test whether:
If both hypotheses are rejected, then the SDM best describes the data. If the hypothesis i) cannot be rejected, then the SLM best describes the data, provided that the (robust) LM tests also pointed to the SLM
If the hypothesis ii) cannot be rejected, then the SEM best describes
the data, provided that the (robust) LM tests also pointed to the SEM.
We can perform LR tests of restrictions on the parameters of spatial
models using the function anova().
# Test hypothesis i)
anova(SAR, SDM)
## Model df AIC logLik Test L.Ratio p-value
## SAR 1 8 743.02 -363.51 1
## SDM 2 13 743.60 -358.80 2 9.4117 0.093727
# Test hypothesis ii)
anova(SDM, SEM)
## Model df AIC logLik Test L.Ratio p-value
## SDM 1 13 743.60 -358.80 1
## SEM 2 8 743.86 -363.93 2 10.257 0.068261
#
anova(SDEM, SEM)
## Model df AIC logLik Test L.Ratio p-value
## SDEM 1 13 740.73 -357.36 1
## SEM 2 8 743.86 -363.93 2 13.133 0.022161
Tobler, Waldo R. 1970. “A Computer Movie Simulating Urban Growth in the Detroit Region.” Economic Geography 46 (2): 234–40. http://www.geog.ucsb.edu/~tobler/publications/pdf_docs/A-Computer-Movie.pdf.↩︎
Wikipedia page of Centroid, https://en.wikipedia.org/wiki/Centroid↩︎
How to find the centre of a polygon i Python, https://deparkes.co.uk/2015/02/28/how-to-find-the-centre-of-a-polygon-in-python/↩︎
saveGIF() documentation https://www.rdocumentation.org/packages/animation/versions/2.4.1/topics/saveGIF↩︎
Spatial regression models documentation, https://rspatial.org/raster/analysis/7-spregression.html↩︎
Spatial Autocorrelation, https://ibis.geog.ubc.ca/courses/geob479/notes/spatial_analysis/spatial_autocorrelation.htm#↩︎
LeSage, J., and R.K. Pace. 2009. Introduction to Spatial Econometrics. Statistics: A Series of Textbooks and Monographs. CRC Press↩︎
LeSage, J. 2014. “What Regional Scientists Need to Know About Spatial Econometrics.” Review of Regional Studies 44 (1): 13–32↩︎
https://cran.r-project.org/web/packages/spatialreg/spatialreg.pdf↩︎
Roger Bivand, Gianfranco Piras (2015). Comparing Implementations of Estimation Methods for Spatial Econometrics. Journal of Statistical Software, 63(18), 1-36. https://www.jstatsoft.org/v63/i18/↩︎